A bit array (also known as bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores . It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.
A finite binary relation may be represented by a bit array called a logical matrix. In the calculus of relations, these arrays are composed with matrix multiplication where the arithmetic is Boolean, and such a composition represents composition of relations.Irving Copilowish (December 1948) "Matrix development of the calculus of relations", Journal of Symbolic Logic 13(4): 193–203 Jstor link
Use OR to set a bit to one:
11101'''0'''10
OR 00000'''1'''00
= 11101'''1'''10
AND to set a bit to zero:
111010'''1'''0
AND 111111'''0'''1
= 111010'''0'''0
AND to determine if a bit is set, by zero-testing:
1110101'''0'''
AND 0000000'''1'''
= 0000000'''0''' (0 means bit is not set)
111010'''1'''0
AND 000000'''1'''0
= 000000'''1'''0 (nonzero means bit is set)
XOR to invert or toggle a bit:
11101'''0'''10
XOR 00000'''1'''00
= 11101'''1'''10
11101'''1'''10
XOR 00000'''1'''00
= 11101'''0'''10
NOT to invert all bits:
NOT 10110010
= 01001101
To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary.
Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/ w simple bit operations each (2 n/ w for difference), as well as the complement of either:
'''for''' i '''from''' 0 '''to''' n/w-1
complement_a[i] := '''not''' a[i]
union[i] := a[i] '''or''' b[i]
intersection[i] := a[i] '''and''' b[i]
difference[i] := a[i] '''and''' ('''not''' b[i])
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only n/ w memory accesses are required:
'''for''' i '''from''' 0 to n/w-1
index := 0 // if needed
word := a[i]
'''for''' b '''from''' 0 to w-1
value := word '''and''' 1 ≠ 0
word := word shift right 1
// do something with value
index := index + 1 // if needed
Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance boost from a data cache. If a cache line is k words, only about n/ wk cache misses will occur.
The last operation can be written ((x&0x55555555) << 1) | (x&0xaaaaaaaa) >> 1)).
Bit arrays are used for , where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.
Bit arrays can be used for the allocation of memory pages, , disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster graphics, which may use multiple color depth.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic based on bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of data compression data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2 n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when , or roughly the term occurs in at least 38% of documents.
import (
"bufio"
"fmt"
"math/bits"
"os"
)
// bitsetSize is the number of bytes needed for 2^32 bits (512 MiB) const bitsetSize = 1 << 29
func main() {
file, err := os.Open("ip_addresses")
if err != nil {
fmt.Println("Error opening file:", err)
return
}
defer file.Close()
bitset := [bitsetSize]byte{}
// Use a buffered scanner with a larger buffer
scanner := bufio.NewScanner(file)
const maxBuffer = 64 * 1024 // 64 KB buffer
buf := make([]byte, 0, maxBuffer)
scanner.Buffer(buf, maxBuffer)
// Process each line
for scanner.Scan() {
line := scanner.Bytes()
// Parse the IP address manually from bytes
ip := parseIPv4(line)
// Set the bit
byteIndex := ip >> 3 // Divide by 8
bitIndex := ip & 7 // Bit position 0-7
bitset[byteIndex] |= 1 << bitIndex
}
// Check for scanning errors
if err := scanner.Err(); err != nil {
fmt.Println("Error reading file:", err)
return
}
var count uint64
for i := 0; i < bitsetSize; i++ {
count += uint64(bits.OnesCount8(bitset[i]))
}
fmt.Println("Number of unique IPv4 addresses:", count)
}
func parseIPv4(line byte) (ip uint32) {
i := 0
// Octet 1
n := uint32(line[i] - '0')
for i = 1; line[i] != '.'; i++ {
n = n*10 + uint32(line[i]-'0')
}
ip |= n << 24
i++ // Skip the dot
// Octet 2
n = uint32(line[i] - '0')
i++
for ; line[i] != '.'; i++ {
n = n*10 + uint32(line[i]-'0')
}
ip |= n << 16
i++ // Skip the dot
// Octet 3
n = uint32(line[i] - '0')
i++
for ; line[i] != '.'; i++ {
n = n*10 + uint32(line[i]-'0')
}
ip |= n << 8
i++ // Skip the dot
// Octet 4
n = uint32(line[i] - '0')
i++
for ; i < len(line); i++ {
n = n*10 + uint32(line[i]-'0')
}
ip |= n
return ip
}
The APL programming language fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations (Dyalog APL, APL2, APL Next, NARS2000, Gnu APL, etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (for example, A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes.
The C programming language's , pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bytewise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the X11 system, <xtrapbits.h>, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the comp.lang.c faq.
In C++, although individual bools typically occupy the same space as a byte or an integer, the STL type std::vector<bool> is a partial template specialization in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does not return a reference to an element, but instead returns a Proxy pattern. This might seem a minor point, but it means that vector<bool> is not a standard STL container, which is why the use of vector<bool> is generally discouraged. Another unique STL class, std::bitset, creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. Like , the size of bitset must be specified at compile time, and cannot be inferred by the compiler. The Boost C++ Libraries provides a boost::dynamic_bitset class whose size is specified at run-time.
The D programming language provides bit arrays in its standard library, Phobos, in std.bitmanip. As in C++, the operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a bool.
In Java, the class (java.util.BitSet) creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class (java.util.EnumSet), which represents a Set of values of an enumerated type internally as a bit vector, as a safer alternative to bit fields.
The .NET Framework supplies a Systems.Collections.BitArray collection class. It stores bits using an array of type int (each element in the array usually represents 32 bits). The class supports random access and bitwise operators, can be iterated over, and its Length property can be changed to grow or truncate it.
Although Standard ML has no support for bit arrays, Standard ML of New Jersey has an extension, the BitArray structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations.
Haskell likewise currently lacks standard support for bitwise operations, but both GHC and Hugs provide a Data.Bits module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over Boolean values may be used to model a Bit array, although this lacks support from the former module.
In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^), and individual bits can be tested and set using the vec function.
In Ruby, one can access (but not set) a bit of an integer (Fixnum or Bignum) using the bracket operator ([]), as if it were an array of bits.
Rust supports using native numeric types as fixed-width bit arrays using bitwise operations. Third party libraries like bitvec provide dedicated interfaces.
Apple's Core Foundation library contains CFBitVector and CFMutableBitVector structures.
PL/I supports arrays of bit strings of arbitrary length, which may be either fixed-length or varying. The array elements may be aligned— each element begins on a byte or word boundary— or unaligned— elements immediately follow each other with no padding.
PL/pgSQL and PostgreSQL's SQL support bit strings as native type. There are two SQL bit types: bit(''<code>n ) and bit varying(''<code>n), where n is a positive integer.
Hardware description languages such as VHDL, Verilog, and SystemVerilog natively support bit vectors as these are used to model storage elements like flip-flops, hardware busses and hardware signals in general. In hardware verification languages such as OpenVera, e and SystemVerilog, bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations.
Common Lisp provides multi-dimensional bit arrays. A one-dimensional bit-vector implementation is provided as a special case of the built-in array, acting in a dual capacity as a class and a type specifier. Bit arrays (and thus bit vectors) relies on the general make-array function to be configured with an element type of bit, which optionally permits a bit vector to be designated as dynamically resizable. The bit-vector, however, is not infinite in extent. A more restricted simple-bit-vector type exists, which explicitly excludes the dynamic characteristics. Bit vectors are represented as, and can be constructed in a more concise fashion by, the reader macro #*<i>bits</i>. In addition to the general functions applicable to all arrays, dedicated operations exist for bit arrays. Single bits may be accessed and modified using the bit and sbit functions and an extensive number of logical operations is supported.
|
|